The RBraLymP (RNA-based Brain Lymphoma Profiler) is a diagnostic algorithm takes gene expression data from either formalin-fixed, paraffin-embedded (FFPE) or fresh-frozen (FF) tissue for identifying the PCNSL molecular subtypes (CS1 to CS4) associated with multi-omic features and clinical outcome. The multi-omic features include information relative to mutations, copy-number alterations (CNA), fusions, gene expression, TCR/BCR clonotypes, tumor microenvironment (TME), methylation, and tumor localization. The characteristic multi-omic features for each significant cluster (CS) group is described in Hernandez-Verdin, I. et al. Molecular and clinical diversity in primary central nervous system lymphoma. Annals of Oncology, 2022 but generally speaking patients with the “immune-hot” CS4 profile had the most favorable clinical outcome followed by the CS1 group with mutations leading to increased proliferation. Patients belonging to CS2/CS3 groups had the worst outcomes, mainly related to tumor location and/or their TME. The present code provides takes an RNA expression matrix (gene names as rows; samples as columns) as input from either FFPE or FF tissue to output a two column matrix with the “Sample ID” and the “Cluster significant [CS] group”. An example on how to use DegNorm to correct transcript degradation due to the parrafin fixation process, if data is coming from FFPE tissue, is also given. Finally, downstream analysis for evaluating clinical impact using univariate-multivariate models after obtaining the CS groups is covered.
A complete list of packages and versions used to compile this vignette is available at the end. A guide on how to install the base packages is provided Here.
library(MOVICS)
library(Hmisc)
library(DESeq2)
#Only if your input data is coming from FFPE tissue
library(DegNorm)
We will download the data from github. Make sure you have internet connection.
download.file("https://github.com/iS4i4S/PCNSL-RBraLymP/blob/main/Data/RBraLymP.RData", destfile = "RBraLymP.RData", method="libcurl")
load("RBraLymP.RData")
## Take a look at gene expression data. Please, note that Gene_names (as Hugo_symbol) are the rows and samples the columns.
#Example of FF data
print(head(Test_FF_PCNSL))
#Example of FFPE data
print(head(Test_FFPE_PCNSL))
The methodology of RBraLymP can be summarized as follows:
Determine if your PCNSL RNA-seq data comes from FFPE or FF tissue.
Obtain a normalized gene expression matrix with genes as Hugo Symbols.
Apply RBraLymP to your normalized gene expression data to identify the PCNSL molecular subtypes.
Import your clinical data and estimate uni-multivariate survival models.
When using gene expression data coming FFPE tissue it is highly recommended to correct RNA reads from transcript degradation due to the paraffin fixation process as described in the Supplementary Appendix from our article. This processes is performed by the R package DegNorm which takes bam files as input (Aligned FASTQ files) and gtf file used for the alignment. The following code is a minimal example edited from DegNorm Vignette.
## specify bam_files from RNA-seq, you should replace it by your own bam files
bam_file_list=list.files(path=system.file("extdata",package="DegNorm"),
pattern=".bam$",full.names=TRUE)
#The three bam files were subsetted from a specific region of chromosome 21 from the origianl bam for package size limitation. Original files can be found from the included reference above.
## gtf_file you used for RNA-seq alignment, replace it by your own gtf file
gtf_file=list.files(path=system.file("extdata",package="DegNorm"),
pattern=".gtf$",full.names=TRUE)
## calculate the read coverage score for all genes of all samples
coverage_res_chr21_sub=read_coverage_batch(bam_file_list, gtf_file,cores=6)
#Note: This process might take long times when using all chromosomes, use a high number of cores or subset the number of samples
summary_CoverageClass(coverage_res_chr21_sub)
res_DegNorm_chr21 = degnorm(read_coverage = coverage_res_chr21_sub[[1]],
counts = coverage_res_chr21_sub[[2]],
iteration = 5,
down_sampling = 1,
grid_size=10,
loop = 100,
cores=8)
##You can visualize the results of this example by running:
data("res_DegNorm_chr21")
summary_DegNormClass(res_DegNorm_chr21)
## extrac normalized read counts
counts_normed=res_DegNorm_chr21$counts_normed
head(counts_normed)
## SRR873822_chr21.bam SRR873834_chr21.bam SRR873838_chr21.bam
## AATBC 9 10 15
## ADAMTS1 970 938 837
## ADAMTS5 87 92 90
## ADARB1 1440 1735 1915
## AGPAT3 2627 2800 2855
## APP 16794 17508 19225
We recommend the Variance stabilizing transformation from DESeq2 package to normalize read counts. You can consult the full documentation about this function Here. Following with the previous example:
#Apply vst function to the DegNorm expression data: counts_normed
#Create a metaData
tmp_metaData<-data.frame(Sample=colnames(counts_normed),Condition=c(rep("Test",2),"Test2"))
DESEq_FFPE <- DESeqDataSetFromMatrix(countData = counts_normed, #The DegNormalized gene expression data,
colData = tmp_metaData,
design = ~ Condition)
dds_tmp <- DESeq(DESEq_FFPE)
vsd_tmp <- vst(dds_tmp, blind=FALSE)
#If Error in vst(dds_tmp, blind = FALSE) : less than 'nsub' rows,
#it is recommended to use varianceStabilizingTransformation directly
#This is because the number of genes is to low...
deseq_vst<-assay(vsd_tmp)
deseq_vst<-as.data.frame(deseq_vst)
head(deseq_vst) ##This is the needed input for RBraLymP
Now we are ready to use RBraLymP on the normalized gene expression FFPE data!!!
Since the previous example was not PCNSL data, here we are going to work with the Test_FFPE_PCNSL which contains gene expression data from 4 samples of our article (each from one different CS group).
##Check the data
head(Test_FFPE_PCNSL)
## Sample1 Sample2 Sample3 Sample4
## DDX11L1 4.271258 4.271258 4.271258 4.271258
## MIR6859-1 4.271258 4.271258 4.271258 4.271258
## WASH7P 4.271258 4.271258 4.271258 4.271258
## MIR1302-2 4.271258 4.271258 4.271258 4.271258
## MIR1302-2HG 4.271258 4.271258 4.271258 4.271258
## FAM138A 4.271258 4.271258 4.271258 4.271258
Now use the RBraLymP function:
##data_type can be either FFPE or FF; for FFPE data "doPlot" option is also avaliable
PCNSL_CS_groups<-RBraLymP(data_type = "FFPE",input_expression = Test_FFPE_PCNSL,doPlot = TRUE)
## --362/400 biomarkers found in input data.
## --correct number of biomarkers matched.
## --original template has 400 biomarkers and 362 are matched in external expression profile.
## cosine correlation distance
## 206/362 features NA, discarded
## 206/362 templates features not in emat, discarded
## 4 samples; 4 classes; 8-68 features/class
## serial processing; 1000 permutation(s)...
## predicted samples/class (FDR<0.05)
##
## CS1 CS2 CS3 CS4 <NA>
## 1 1 0 1 1
## --Done!!
##Check the results
head(PCNSL_CS_groups$ntp.res)
## prediction d.CS1 d.CS2 d.CS3 d.CS4 p.value FDR
## Sample1 CS4 0.7260 0.7218 0.6645 0.4468 0.0010 0.0040
## Sample2 CS3 0.6640 0.6629 0.6181 0.7052 0.1968 0.1968
## Sample3 CS1 0.6677 0.7540 0.7805 0.8178 0.0100 0.0200
## Sample4 CS2 0.7681 0.6754 0.7406 0.7923 0.0340 0.0453
head(PCNSL_CS_groups$clust.res)
## samID clust
## Sample1 Sample1 4
## Sample2 Sample2 3
## Sample3 Sample3 1
## Sample4 Sample4 2
If data is coming from FF data, DegNorm DO NOT need to be used. Instead only VST normalization of RNA reads is recommended. Considering RNA reads are already normalized, the code is very straightforward:
head(Test_FF_PCNSL)
## Sample1 Sample2 Sample3 Sample4
## TSPAN6 7.921814 11.216060 8.553405 9.537967
## TNMD 6.093985 6.733163 6.093985 6.458508
## DPM1 12.502109 10.743858 12.497105 11.897794
## SCYL3 11.151200 10.892821 10.737844 10.951266
## C1orf112 12.291237 10.510212 11.498621 11.374098
## FGR 12.146008 11.664934 12.870950 11.756512
Here we also included a gene expression example of 4 samples from our FF cohort (each from one different CS group).
Now use the RBraLymP function:
PCNSL_CS_groups<-RBraLymP(data_type = "FF",input_expression = Test_FF_PCNSL)
## --2087/2087 biomarkers found in input data.
## --correct number of biomarkers matched.
## --all samples matched.
## --a total of 2087 genes shared and used.
## --training expression profile seems to have been standardised (z-score or log transformation), no more action will be performed.
## --testing expression profile seems to have been standardised (z-score or log transformation), no more action will be performed.
## --Done!!
##Check the results
head(PCNSL_CS_groups$clust.res)
## samID clust
## Sample1 Sample1 1
## Sample2 Sample2 2
## Sample3 Sample3 3
## Sample4 Sample4 4
To visualize the clinical impact of the PCNSL molecular subtypes on a Kaplan-Meier Curve, we use the compSurv function from the MOVICS package. This function not only calculates the overall nominal P value by log-rank test, but also performs pairwise comparison and derives adjusted P values across the four PCNCSL subtypes. The input of the function are: 1) a moic.res, which is the resulting object from running RBraLymP, and 2) a data.frame with clinical information (must has row names of samples) that stores a futime column for survival time (unit of day) and another fustat column for survival outcome (0 = censor; 1 = event).
##Example of needed clinical data
head(Clinic_example)
## fustat futime
## Sample1 0 1750.70
## Sample2 0 549.25
## Sample3 1 298.90
## Sample4 1 1826.95
Run the compSurv function in the FF cohort. Note that the full clinical data (FF cohort) is not provided, it must be downloaded from the EGAS repository: EGAS00001006191.
surv.pcnsl.FF <- compSurv(moic.res = PCNSL_CS_groups,
surv.info = Clinic_example,
convt.time = "m", # convert day unit to month
surv.median.line = "h", # draw horizontal line at median survival
xyrs.est = c(5,10), # estimate 5 and 10-year survival
fig.name = "KAPLAN-MEIER_FF_cohort")
Multivariate analysis can be calculated as follows:
Clinic_example$IK_binary<-c(rep(">=70",4))
Clinic_example$Age_binary<-c(rep(">=70",4))
head(Clinic_example)
#Full clinical data (FF cohort, n=85) is not provided, it must be downloaded from the [EGAS repository]
ggforest( coxph(Surv(futime, fustat) ~ Cluster_consensus+ IK_binary+ Age_binary, data = Clinic_example),data = Clinic_example,fontsize = 1.5)
sessionInfo()
## R version 4.2.2 Patched (2022-11-10 r83330)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=fr_FR.UTF-8
## [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 grid stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DegNorm_1.6.1 DESeq2_1.36.0 SummarizedExperiment_1.26.1
## [4] MatrixGenerics_1.8.1 MOVICS_0.99.17 BiocStyle_2.24.0
## [7] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.64.0 rtracklayer_1.56.1
## [10] Biostrings_2.64.1 XVector_0.36.0 maftools_2.12.0
## [13] circlize_0.4.15 clusterProfiler_4.4.4 GSVA_1.44.5
## [16] rmarkdown_2.18 ggbeeswarm_0.6.0 enrichplot_1.16.2
## [19] DOSE_3.22.1 IHW_1.24.0 matrixStats_0.62.0
## [22] survminer_0.4.9 e1071_1.7-12 lawstat_3.5
## [25] ComplexHeatmap_2.12.1 pheatmap_1.0.12 Rcpp_1.0.9
## [28] Hmisc_4.7-1 Formula_1.2-4 survival_3.4-0
## [31] lattice_0.20-45 ggrepel_0.9.2 corrplot_0.92
## [34] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.48.4 GenomicRanges_1.48.0
## [37] GenomeInfoDb_1.32.4 annotate_1.74.0 XML_3.99-0.12
## [40] AnnotationDbi_1.58.0 IRanges_2.30.1 S4Vectors_0.34.0
## [43] Biobase_2.56.0 BiocGenerics_0.42.0 seqinr_4.2-16
## [46] sqldf_0.4-11 RSQLite_2.2.18 gsubfn_0.7
## [49] proto_1.0.0 TCGAutils_1.16.1 forestplot_3.1.0
## [52] abind_1.4-5 checkmate_2.1.0 scales_1.2.1
## [55] binom_1.1-1.1 ggsignif_0.6.4 RColorBrewer_1.1-3
## [58] cowplot_1.1.1 data.table_1.14.4 forcats_0.5.2
## [61] stringr_1.4.1 purrr_0.3.5 readr_2.1.3
## [64] tibble_3.1.8 tidyverse_1.3.2 dplyr_1.0.10
## [67] tidyr_1.2.1 reshape2_1.4.4 ggpubr_0.4.0
## [70] ggplot2_3.4.0
##
## loaded via a namespace (and not attached):
## [1] mitools_2.4 graphlayouts_0.8.3 haven_2.5.1 tcltk_4.2.2
## [5] vctrs_0.5.0 graphite_1.42.0 mgcv_1.8-41 blob_1.2.3
## [9] nloptr_2.0.3 DBI_1.1.3 R.utils_2.12.2 SingleCellExperiment_1.18.1
## [13] rappdirs_0.3.3 jstable_1.0.7 jpeg_0.1-9 zlibbioc_1.42.0
## [17] htmlwidgets_1.5.4 mvtnorm_1.1-3 GlobalOptions_0.1.2 parallel_4.2.2
## [21] irlba_2.3.5.1 markdown_1.3 tidygraph_1.2.2 KernSmooth_2.23-20
## [25] limma_3.52.4 DelayedArray_0.22.0 vegan_2.6-4 magick_2.7.3
## [29] graph_1.74.0 RcppParallel_5.1.5 fs_1.5.2 fastmatch_1.1-3
## [33] Kendall_2.2.1 digest_0.6.30 png_0.1-7 GenomicDataCommons_1.20.3
## [37] bookdown_0.30 scatterpie_0.1.8 ggraph_2.1.0 pkgconfig_2.0.3
## [41] GO.db_3.15.0 gridBase_0.4-7 ggnewscale_0.4.8 DelayedMatrixStats_1.18.2
## [45] minqa_1.2.5 iterators_1.0.14 beeswarm_0.4.0 modeltools_0.2-23
## [49] GetoptLong_1.0.5 xfun_0.34 bslib_0.4.1 zoo_1.8-11
## [53] tidyselect_1.2.0 DNAcopy_1.70.0 labelled_2.10.0 viridisLite_0.4.1
## [57] rlang_1.0.6 jquerylib_0.1.4 glue_1.6.2 mogsa_1.30.0
## [61] registry_0.5-1 modelr_0.1.10 pkgmaker_0.32.2.900 sva_3.44.0
## [65] aricode_1.0.1 svd_0.5.2 labeling_0.4.2 class_7.3-20
## [69] preprocessCore_1.58.0 pamr_1.56.1 corpcor_1.6.10 DO.db_2.9
## [73] webshot_0.5.4 jsonlite_1.8.3 clusterRepro_0.9 bit_4.0.4
## [77] gridExtra_2.3 gplots_3.1.3 Rsamtools_2.12.0 stringi_1.7.8
## [81] ConsensusClusterPlus_1.60.0 survey_4.1-1 rbibutils_2.2.9 yulab.utils_0.0.5
## [85] bitops_1.0-7 cli_3.4.1 Rdpack_2.4 rhdf5filters_1.8.0
## [89] heatmaply_1.4.0 timechange_0.1.1 officer_0.4.4 rstudioapi_0.14
## [93] TSP_1.2-1 GenomicAlignments_1.32.1 nlme_3.1-160 qvalue_2.28.0
## [97] locfit_1.5-9.6 ridge_3.3 coca_1.1.0 gridGraphics_0.5-1
## [101] survMisc_0.5.6 R.oo_1.25.0 dbplyr_2.2.1 entropy_1.3.1
## [105] readxl_1.4.1 lifecycle_1.0.3 commonmark_1.8.1 munsell_0.5.0
## [109] cellranger_1.1.0 R.methodsS3_1.8.2 ggalluvial_0.12.3 caTools_1.18.2
## [113] codetools_0.2-18 MultiAssayExperiment_1.22.0 vipor_0.4.5 ClassDiscovery_3.4.0
## [117] htmlTable_2.4.1 ggpp_0.4.5 xtable_1.8-4 googlesheets4_1.0.1
## [121] BiocManager_1.30.19 lpsymphony_1.24.0 farver_2.1.1 FNN_1.1.3.1
## [125] km.ci_0.5-6 aplot_0.1.8 ggtree_3.4.2 BiocIO_1.6.0
## [129] patchwork_1.1.2 cluster_2.1.4 dendextend_1.16.0 prettyGraphs_2.1.6
## [133] Matrix_1.5-3 tidytree_0.4.1 ellipsis_0.3.2 ExPosition_2.8.23
## [137] prettyunits_1.1.1 CIMLR_1.0.0 lubridate_1.9.0 googledrive_2.0.0
## [141] reprex_2.0.2 mclust_6.0.0 igraph_1.3.5 fgsea_1.22.0
## [145] slam_0.1-50 gargle_1.2.1 htmltools_0.5.3 BiocFileCache_2.4.0
## [149] yaml_2.3.6 NMF_0.30.1 plotly_4.10.1 utf8_1.2.2
## [153] foreign_0.8-83 withr_2.5.0 PINSPlus_2.0.6 BiocParallel_1.30.4
## [157] bit64_4.0.5 rngtools_1.5.2 foreach_1.5.2 GOSemSim_2.22.0
## [161] rsvd_1.0.5 ScaledMatrix_1.4.1 memoise_2.0.1 evaluate_0.18
## [165] geepack_1.3.9 coxme_2.2-18.1 permute_0.9-7 geneplotter_1.74.0
## [169] tzdb_0.3.0 curl_4.3.3 fdrtool_1.2.17 fansi_1.0.3
## [173] highr_0.9 GSEABase_1.58.0 edgeR_3.38.4 SNFtool_2.3.1
## [177] cachem_1.0.6 org.Hs.eg.db_3.15.0 iClusterPlus_1.32.0 interp_1.1-3
## [181] deldir_1.0-6 CMScaller_2.0.1 impute_1.70.0 rjson_0.2.21
## [185] rstatix_0.7.1 oompaBase_3.2.9 ade4_1.7-20 clue_0.3-62
## [189] alluvial_0.1-2 tools_4.2.2 tableone_0.13.2 sass_0.4.2
## [193] magrittr_2.0.3 RCurl_1.98-1.9 proxy_0.4-27 car_3.1-1
## [197] ape_5.6-2 ggplotify_0.1.0 xml2_1.3.3 httr_1.4.4
## [201] assertthat_0.2.1 boot_1.3-28 R6_2.5.1 Rhdf5lib_1.18.2
## [205] nnet_7.3-18 progress_1.2.2 genefilter_1.78.0 KEGGREST_1.36.3
## [209] treeio_1.20.2 gtools_3.9.3 shape_1.4.6 beachmat_2.12.0
## [213] HDF5Array_1.24.2 BiocSingular_1.12.0 rhdf5_2.40.0 splines_4.2.2
## [217] flexclust_1.4-1 carData_3.0-5 oompaData_3.1.2 ggfun_0.0.8
## [221] colorspace_2.0-3 generics_0.1.3 base64enc_0.1-3 chron_2.3-58
## [225] gridtext_0.1.5 pillar_1.8.1 tweenr_2.0.2 InterSIM_2.2.0
## [229] uuid_1.1-0 GenomeInfoDbData_1.2.8 plyr_1.8.8 gtable_0.3.1
## [233] zip_2.2.2 bdsmatrix_1.3-6 rvest_1.0.3 restfulr_0.0.15
## [237] knitr_1.40 latticeExtra_0.6-30 shadowtext_0.1.2 biomaRt_2.52.0
## [241] fastmap_1.1.0 seriation_1.4.0 Cairo_1.6-0 doParallel_1.0.17
## [245] ca_0.71.1 broom_1.0.1 filelock_1.0.2 backports_1.4.1
## [249] lme4_1.1-31 hms_1.1.2 ggforce_0.4.1 KMsurv_0.1-5
## [253] polyclip_1.10-4 lazyeval_0.2.2 crayon_1.5.2 MASS_7.3-58.1
## [257] downloader_0.4 sparseMatrixStats_1.8.0 viridis_0.6.2 rpart_4.1.19
## [261] IntNMF_1.2.0 compiler_4.2.2 ggtext_0.1.2